Audit Sampling With Autoencoders¶

Welcome! In this notebook we will explore the use of autoencoder neural networks to learn low-dimensional representations of financial transactions and identify a representative sample for financial statement audits.

Acknowledgements¶

This notebook is largely inspired by a paper published in 2020 by Marco Schreyer, Timur Sattarov, Anita Gierbl, Bernd Reimer, and Damian Borth, entitled Learning Sampling in Financial Statement Audits using Vector Quantised Autoencoder Neural Networks, as well as an excellent demonstration of autoencoders by Alexander Van de Kleut, entitled Variational Autoencoders (VAE) with PyTorch


Background¶

Auditing standards require the assessment of the underlying transactions that comprise the financial statements to detect errors or fraud that would result in material misstatement. The accounting profession has developed a framework for addressing this requirement, known as the Audit Risk Model.

The Audit Risk Model defines audit risk as the combination of inherent risk, control risk and detection risk:

$Audit Risk = Inherent Risk \times Control Risk \times Detection Risk$

Detection risk is composed of sampling risk and non-sampling risk:

$Detection Risk= Sampling Risk + Non$-$Sampling Risk$

We focus our attention on sampling risk, which is defined as the risk that the auditor's conclusion based on the sample would be different had the entire population been tested. In other words, it is the risk that the sample is not representative of the population and does not provide sufficient appropriate audit evidence to detect material misstatements.

There are a variety of sampling methods used by auditors. Random sampling is based on each member of the population having an equal chance of being selected. Stratified sampling subdivides the population into homogenous groups from which to make selections. Monetary unit sampling treats each dollar amount from the population as the sampling unit and selects items when a cumulative total meets or exceeds a predefined sampling interval when cycling through the population.

Autoencoders offer an alternative method for addressing sampling risk. An autoencoder is a neural network that learns to encode data into lower dimensions and decode it back into higher dimensions. The resulting model provides a low-dimensional representation of the data, disentangling it in a way that reveals something about the fundamental structure. Auditors can model transactions in this way and select from low-dimensional clusters. They can also identify anomalous transactions based on how much they deviate from other transactions in this latent space.


The City of Philadelphia Payments Data¶

To demonstrate the use of autoencoders for financial transaction analysis, we will use the City of Philadelphia payments data. It is one of two datasets used in Schreyer et al (2020) and consists of nearly a quarter-million payments from about 60 city offices, departments, boards and commissions. It covers the City's fiscal year 2017 (July 2016  through June 2017) and represents nearly $4.2 billion in payments during that period.

Let's begin by reading in the modules we'll use to analyze these data:

In [1]:
# import preprocessing modules
import time
import datetime
import calendar
import numpy as np
import pandas as pd
import datapane as dp
import plotly.express as px
import plotly.graph_objects as go
from collections import defaultdict
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

# import pytorch modules
import torch
torch.manual_seed(0)
import torch.nn as nn
import torch.distributions
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

# use GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

Read in the data:

In [2]:
df = pd.read_csv('data/city_payments_fy2017.csv')

View the first five rows:

In [3]:
df.head()
Out[3]:
fy fm check_date document_no dept department_title char_ character_title sub_obj sub_obj_title vendor_name doc_ref_no_prefix doc_ref_no_prefix_definition contract_number contract_description transaction_amount
0 2017 10 2017-04-12 CHEK17119771 42 42 COMMERCE 2 02 PURCHASE OF SERVICES 231 OVERTIME MEALS 0231 EAT AT JOE'S PVXX payment voucher NaN NaN 66.82
1 2017 12 2017-06-09 ACHD17177233 26 26 LICENSES & INSPECTIONS 2 02 PURCHASE OF SERVICES 211 TRANSPORTATION 0211 L & I  TRAVEL IMPREST FUND PVXX payment voucher NaN NaN 127.33
2 2017 5 2016-11-25 CHEK17063736 44 44 LAW 2 02 PURCHASE OF SERVICES 258 COURT REPORTERS 0258 MARLENE BELL REPORTING, INC. PVXX payment voucher NaN NaN 454.20
3 2017 1 2016-07-07 CHEK17000247 11 11 POLICE 2 02 PURCHASE OF SERVICES 260 REPAIR AND MAINTENANCE CHARGES 0260 RICOH AMERICAS CORPORATION VCXX procurement NaN NaN 50.00
4 2017 1 2016-07-08 ACHD17000233 23 23 PRISONS 3 03 MATERIALS AND SUPPLIES 313 FOOD 0313 PHILADELPHIA PRISONS PCXX petty cash NaN NaN 71.92

Check the dimensions of the data:

In [4]:
df.shape
Out[4]:
(238894, 16)

We can see that several features of this dataset are redundant and some records have NaN ("Not a number") values. Next, we will clean up the data to fill in NaNs with "None". This is reasonable since it is informational that there are some payments that do not come with a contract number or description and we want to retain this information for modeling. We'll also remove redundant columns and add new features for amount_sign, payment_method, weekday, day, month, and year.

In [5]:
%%time
# clean up the data
df.department_title = df.department_title.str.split().apply(
    lambda x: ' '.join(x[1:len(x)]))

df.character_title = df.character_title.str.split().apply(
    lambda x: ' '.join(x[1:len(x)]))

df.sub_obj_title = df.sub_obj_title.str.split().apply(
    lambda x: ' '.join(x[0:len(x)-1]))

df['amount_sign'] = df.transaction_amount.apply(
    lambda x: 1 if x >= 0 else 0)

df.transaction_amount = df.transaction_amount.abs()

df['payment_method'] = df.document_no.str.split('1').apply(
    lambda x: x[0])

df['weekday'] = df.check_date.str.split('-').apply(
    lambda x: calendar.day_name[
    datetime.datetime(int(x[0]), 
                      int(x[1]), 
                      int(x[2])).weekday()])

df['day'] = df.check_date.str.split('-').apply(
    lambda x: x[2])

df['month'] = df.check_date.str.split('-').apply(
    lambda x: calendar.month_name[int(x[1])])

df['year'] = df.check_date.str.split('-').apply(
    lambda x: x[0])

df.drop(['fy', 'fm', 'check_date', 'dept', 'char_', 'sub_obj', 
         'doc_ref_no_prefix','contract_number'], 
        axis=1, inplace=True)

df.fillna('None', inplace=True)
CPU times: total: 6.02 s
Wall time: 6.19 s

Here is the result of this initial preprocessing:

In [6]:
df.head()
Out[6]:
document_no department_title character_title sub_obj_title vendor_name doc_ref_no_prefix_definition contract_description transaction_amount amount_sign payment_method weekday day month year
0 CHEK17119771 COMMERCE PURCHASE OF SERVICES OVERTIME MEALS EAT AT JOE'S payment voucher None 66.82 1 CHEK Wednesday 12 April 2017
1 ACHD17177233 LICENSES & INSPECTIONS PURCHASE OF SERVICES TRANSPORTATION L & I  TRAVEL IMPREST FUND payment voucher None 127.33 1 ACHD Friday 09 June 2017
2 CHEK17063736 LAW PURCHASE OF SERVICES COURT REPORTERS MARLENE BELL REPORTING, INC. payment voucher None 454.20 1 CHEK Friday 25 November 2016
3 CHEK17000247 POLICE PURCHASE OF SERVICES REPAIR AND MAINTENANCE CHARGES RICOH AMERICAS CORPORATION procurement None 50.00 1 CHEK Thursday 07 July 2016
4 ACHD17000233 PRISONS MATERIALS AND SUPPLIES FOOD PHILADELPHIA PRISONS petty cash None 71.92 1 ACHD Friday 08 July 2016

Exploratory Data Analysis (EDA)¶

Next, we'll explore the data to get a sense of its characteristics.

In [7]:
df.describe().round()
Out[7]:
transaction_amount amount_sign
count 238894.0 238894.0
mean 17574.0 1.0
std 436585.0 0.0
min 0.0 0.0
25% 66.0 1.0
50% 250.0 1.0
75% 1236.0 1.0
max 99000000.0 1.0

Let's have a look at the distribution of the transaction amounts using a histogram with 100 bins:

In [8]:
fig = px.histogram(df, x='transaction_amount', log_y=True, nbins=100,
                   title='Histogram of Transaction Amounts', width=900)

fig.update_traces(
    marker_line_width=1, 
    marker_line_color='black',
    marker=dict(color='gold'),
)

fig.update_xaxes(title='Transaction amount ($ millions)',
                 range=[-2000000,102000000],
                 gridcolor='lightgray',
                 showgrid=True, 
                 gridwidth=1)

fig.update_yaxes(title='Count (log-scaled)',
                 gridcolor='lightgray',
                 showgrid=True, 
                 gridwidth=1)

fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  hoverlabel=dict(
                      bgcolor='ivory',
                      font_size=16,
                      font_family='Rockwell'),
                  font=dict(family='Rockwell', 
                            color='navy',
                            size=16), 
                  title_x=0.5)
fig.show()

It looks like the majority of transactions are of an amount equal to or less than $500,000. It is also apparent that there are several outlier payments that are of a relatively high dollar amount. Next, let's look at the spend by city department:

In [9]:
top_twenty_depts = (df.department_title.value_counts()\
                    / df.department_title.value_counts().sum()\
                    * 100).head(20)[::-1]

top_twenty_depts = top_twenty_depts.reset_index().rename(
    columns={'department_title':'share',
             'index':'department'})

fig = px.bar(top_twenty_depts, x='share', y='department',
             title='Top Twenty Departments By Share of Spend',
             width=700, orientation='h')

fig.update_traces(marker_line_width=1, 
                  marker_line_color='black',
                  marker=dict(color='lightgreen'),
                  hovertemplate='<br>'.join([
                      'Department: %{y}',
                      'Share: %{x:.2f}%']))

fig.update_xaxes(title='Share of spend',
                 range=[-0.25,23],
                 gridcolor='lightgray',
                 showgrid=True, 
                 gridwidth=1, 
                 tickvals=[0,5,10,15,20],
                 ticktext=[f'{x}%' for x in range(0,21,5)])

fig.update_yaxes(title='',
                 gridcolor='lightgray',
                 showgrid=True, 
                 gridwidth=1)

fig.update_layout(title_font_size=20,
                  paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  showlegend=False,
                  hoverlabel=dict(
                      bgcolor='ivory',
                      font_size=16,
                      font_family='Rockwell'),
                  font=dict(family='Rockwell', 
                            color='navy',
                            size=12), 
                  title_x=0.5)
fig.show()

Of the 58 city departments included in the dataset, 60% of spend goes towards fleet management, managing director, human resources, water and health. What about the nature of these expenses? The document reference feature sheds some light on this:

In [10]:
doc_ref_df = df.doc_ref_no_prefix_definition.value_counts().reset_index()
doc_ref_df.doc_ref_no_prefix_definition = doc_ref_df.doc_ref_no_prefix_definition\
/ doc_ref_df.doc_ref_no_prefix_definition.sum() * 100

fig = px.pie(doc_ref_df, names='index', values='doc_ref_no_prefix_definition',
             labels='index', title='Spend By Document Reference'+\
             '<br><sup>hover over to see labels</sup>',
             hole=0.5, color_discrete_sequence=px.colors.qualitative.Set3, 
             width=500, height=500).update_traces(marker_line_color='rgb(0,0,0)',
                                                  textinfo='none',
                                                  hovertemplate='<br>'.join([
                                                      'doc ref:  %{label}',
                                                      'share of spend:  '+\
                                                      '%{value:.2f}%']),
                                                  textfont_size=14,
                                                  marker_line_width=1)
fig.update_layout(title_font_size=20,
                  showlegend=False,
                  hoverlabel=dict(
                      font_size=16,
                      font_family='Rockwell'),
                  font=dict(family='Rockwell', 
                            color='navy',
                            size=12), 
                  title_x=0.5)
fig.show()

Hovering over the doughnut plot above, we find the following breakdown of expenses by document reference:

  • 39% procurement
  • 17% auto parts
  • 15% professional services
  • 12% court appointed attorneys
  • … 

What days do payments tend to be made on?

In [12]:
day_counts = df.day.value_counts().sort_index().reset_index()
day_counts.rename(columns={'index':'day','day':'count'}, inplace=True)

fig = px.line(day_counts, x='day', y='count', 
              title='Frequency of Transactions by Day of the Month')

fig.update_traces(mode='markers+lines')

fig.update_xaxes(gridcolor='lightgray',
                 showgrid=True, 
                 gridwidth=1)

fig.update_yaxes(gridcolor='lightgray',
                 showgrid=True, 
                 gridwidth=1)

fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  title_font_size=20,
                  showlegend=False,
                  hoverlabel=dict(
                      bgcolor='ivory',
                      font_size=16,
                      font_family='Rockwell'),
                  font=dict(family='Rockwell', 
                            color='navy',
                            size=14), 
                  title_x=0.5)

fig.show()

While payments are daily, it looks like there ae four times in a month when they are more frequent. This might coincide with the payroll cycle or reflect vendor payment terms (i.e. 2/10 net 30).

Extract, Transform & Load (ETL)¶

Now that we have a sense of our dataset, let's prepare it for modeling. Note that the ETL steps in this section are collected in the preprocessing.py module provided in this repository.

We will need our data to be embedded in a vector space so we can apply numerical analysis to it. Let's begin by getting the features that are categorical and numerical:

In [13]:
# determine categorical and numerical features
numerical = df.select_dtypes(include=['int64', 'float64']).columns
categorical = df.select_dtypes(include=['object', 'bool']).columns
In [14]:
print(numerical)
print(categorical)
Index(['transaction_amount', 'amount_sign'], dtype='object')
Index(['document_no', 'department_title', 'character_title', 'sub_obj_title',
       'vendor_name', 'doc_ref_no_prefix_definition', 'contract_description',
       'payment_method', 'weekday', 'day', 'month', 'year'],
      dtype='object')

Create a dictionary to store the feature names to help with plotting later:

In [19]:
# get feature names for plotting
feature_names = dict(zip(range(len(categorical) +\
                               len(numerical)-1),
                         categorical.tolist()[1:] +\
                         numerical.tolist()))
In [20]:
feature_names
Out[20]:
{0: 'department_title',
 1: 'character_title',
 2: 'sub_obj_title',
 3: 'vendor_name',
 4: 'doc_ref_no_prefix_definition',
 5: 'contract_description',
 6: 'payment_method',
 7: 'weekday',
 8: 'day',
 9: 'month',
 10: 'year',
 11: 'transaction_amount',
 12: 'amount_sign'}

Next, we instantiate a LabelEncoder to encode the categorical features, replacing strings with nominal numerical values. This will allow us to set the color of plot markers to the different values of categorical features:

In [15]:
le = defaultdict(LabelEncoder)
In [16]:
# Label encoding
Y = df[categorical].apply(lambda x: le[x.name].fit_transform(x))
Y = pd.concat([Y, df[numerical]], axis=1)
In [17]:
Y.head()
Out[17]:
document_no department_title character_title sub_obj_title vendor_name doc_ref_no_prefix_definition contract_description payment_method weekday day month year transaction_amount amount_sign
0 212404 13 4 104 1560 7 1120 1 4 11 0 1 66.82 1
1 106470 24 4 201 2894 7 1120 0 0 8 6 1 127.33 1
2 168343 23 4 32 3211 7 1120 1 0 24 9 0 454.20 1
3 119090 44 4 180 4274 9 1120 1 2 6 5 0 50.00 1
4 209 45 3 50 3959 8 1120 0 0 7 5 0 71.92 1

Now, for purposes of this demonstration, we will use a subset of the data for faster training:

In [30]:
# take a sample that will become our X matrix
X_sample = df.sample(n=128*373, random_state=1729).drop(
    columns='document_no')

# take another sample using the same random_sate
# thus, same indices as df_sample, to use to recover labels
Y_sample = Y.sample(n=128*373, random_state=1729)
In [27]:
X_sample.shape , Y_sample.shape
Out[27]:
((47744, 13), (47744, 14))

While we are at it, we'll also define a function to conveniently recover the labels from the label encoded Y matrix:

In [38]:
def recover_labels(Y):
    '''
    Recover data labels from tensor to pandas DataFrame
    by applying the inverse transform of the label encoder
    on the label encoded payments data

    Parameters:
        Y: torch.Tensor of the label encoded payments data

    Returns:
        recovered_Y: pandas DataFrame of payments data
        with interpretable labels instead of encoded numbers
    '''
    recovered_Y = pd.concat([
        pd.DataFrame(Y, columns=Y_sample.columns)[categorical]\
        .astype('int32').apply(lambda x: le[x.name].inverse_transform(x)), 
        pd.DataFrame(Y, columns=Y_sample.columns)[numerical]
    ], axis=1)

    return recovered_Y

Let's try this out. Here is the label encoded Y_sample ...

In [39]:
Y_sample.head()
Out[39]:
document_no department_title character_title sub_obj_title vendor_name doc_ref_no_prefix_definition contract_description payment_method weekday day month year transaction_amount amount_sign
214627 197810 25 4 84 5475 3 1120 1 2 1 7 1 250.00 1
46641 223236 25 4 84 3606 3 1120 1 2 17 8 1 150.00 1
119334 148015 39 4 179 4007 9 1461 1 2 5 10 0 853.00 1
17132 7159 21 3 100 4688 9 1560 0 0 28 5 0 14.63 1
131162 155558 19 3 96 2205 2 1120 1 2 19 10 0 25.62 1

... and here is the same Y_sample with the labels recovered:

In [41]:
recover_labels(Y_sample).head()
Out[41]:
document_no department_title character_title sub_obj_title vendor_name doc_ref_no_prefix_definition contract_description payment_method weekday day month year transaction_amount amount_sign
214627 CHEK17101085 MANAGING DIRECTOR PURCHASE OF SERVICES LEGAL SERVICES YALONDA E HOUSTON court appointed attorneys None CHEK Thursday 02 March 2017 250.00 1
46641 CHEK17133085 MANAGING DIRECTOR PURCHASE OF SERVICES LEGAL SERVICES NGHI D VO court appointed attorneys None CHEK Thursday 18 May 2017 150.00 1
119334 CHEK17037586 OFFICE OF HOUSING PURCHASE OF SERVICES RENTS PITNEY BOWES INCORPORATED procurement Rental of Pitney Bowes Mailing Equipment CHEK Thursday 06 October 2016 853.00 1
17132 ACHD17013816 HEALTH MATERIALS AND SUPPLIES OFFICE MATERIALS AND SUPPLIES STAPLES BUSINESS ADVANTAGE procurement STAPLES CONTRACT & COMMERCIAL INC Bid # 010615NJ ACHD Friday 29 July 2016 14.63 1
131162 CHEK17047593 FLEET MANAGEMENT MATERIALS AND SUPPLIES MOTOR VEHICLE PARTS AND ACCESSORIES IEH AUTO PARTS LLC auto parts None CHEK Thursday 20 October 2016 25.62 1

Next, let's one-hot-encode the categorical features (everything except for transaction amount):

In [31]:
%%time
# create dummy variables to one-hot-encode
X_sample = pd.get_dummies(X_sample, drop_first=True)
CPU times: total: 719 ms
Wall time: 892 ms

Partition the data into train and test sets:

In [33]:
%%time
# create training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(
    X_sample, Y_sample, test_size=128*73, random_state=1729)
CPU times: total: 1.06 s
Wall time: 1.12 s

Min-max scale transaction amount:

In [34]:
# min-max scale the transaction amount
X_train.transaction_amount = MinMaxScaler().fit_transform(
    X_train.transaction_amount.to_numpy().reshape(-1,1))

X_test.transaction_amount = MinMaxScaler().fit_transform(
    X_test.transaction_amount.to_numpy().reshape(-1,1))

Convert pandas DataFrames into Pytorch FloatTensors:

In [35]:
# create tensors
X_train = torch.FloatTensor(X_train.values)
X_test = torch.FloatTensor(X_test.values)
Y_train = torch.FloatTensor(Y_train.values)
Y_test = torch.FloatTensor(Y_test.values)

Combine features and labels for each of the train and test sets into their respective PyTorch DataLoader objects:

In [37]:
# combine X_ and Y_ into a single dataset
train_set = TensorDataset(X_train, Y_train)
train_set = DataLoader(train_set, 
                       batch_size=128, 
                       shuffle=True)

test_set = TensorDataset(X_test, Y_test)
test_set = DataLoader(test_set, 
                      batch_size=128, 
                      shuffle=True)

Finally, we can save these train and test DataLoaders to our working directory for fast retrieval later on. The torch.save() method uses the same process of byte serialization that joblib and pickle modules use:

In [42]:
data = [train_set, test_set]
torch.save(data, 'data/philly_payments_clean')

Modeling & Evaluation¶

Now that we have preprocessed the data, let's model it. We can define some autoencoder classes with the help of PyTorch, a helpful open source library for deep learning. A brief note about PyTorch: it works by generating directed acyclic graphs (DAG) of the neural network at each iteration and uses an autograd protocol that computes gradients of the operations recorded in the graph for efficient updates of the parameters during training.

Returning to our objective, if we can project financial transaction vectors into a subspace that defines most of the variance, we'll have organized the data in such a way that transactions are separated into concisely summarized groups representing the whole population. This representation might provide a better perspective of the underlying patterns in the data or amplify anomalous items to facilitate detection. What, then, is the best low-dimensional representation of our population and how do we find it?

In general, we can use unsupervised learning to identify the latent (underlying) structure of our data. Popular unsupervised learning techniques include principal components analysis (PCA), singular value decomposition (SVD), and t-SNE. Another technique for learning low-dimensional representations is through training an autoencoder.

Traditional Autoencoder Architechture

Autoencoders are considered an unsupervised technique in that learning happens without respect to known target labels. Figure 1 shows the typical structure of an autoencoder, consisting of two networks: an encoder and a decoder. These two networks meet in the middle at a matrix $Z$, the latent representation we are interested in.

To train an autoencoder, we begin by embedding transactional data in a vector space, represented as the matrix $X$. Batch samples of $X$ are passed through the encoder, which maps $X$ to the matrix $Z$. $Z$ is then passed through the decoder to reconstruct the original input as $X^*$. The difference between $X$ and its reconstruction $X^*$ is the reconstruction loss, defined as follows:

where:

$d(e(x)) =$ the reconstructed output of the composite encoder and decoder network

$||*||^{2}_{2} = $ the L2-norm squared or the sum of squared errors between $X$ and $X^*$

Alternatively, we could use a variational autoencoder to model the data. This kind of autoencoder is distinct from a traditional autoencoder in that there are generally two output layers of the encoder that represent the mean and standard deviation of a distribution. Further, the latent matrix $Z$ is determined by sampling from a Gaussian distribution parameterized by the means and standard deviations sampled from the encoder's output layers, as seen below:

Variational Autoencoder Architechture

With variational autoencoders, it is common to add an additional term to the loss function that penalizes the learned parameters for having divergent means and small standard deviations. This term is known as the Kullback–Leibler divergence and is a measure of how different two probability distributions are from one another. In simple terms, it is a measure of how much two distributions overlap. The he KL divergence between $p(z \mid x)$ and $\mathcal{N}(0, 1)$, can be defined as

Source: Alexander Van de Kleut, Variational Autoencoders (VAE) with PyTorch

We combine the $\mathbb{KL}$ loss with the sum of squared errors from the traditional autoencoder to define the loss function for the variational autoencoder:

where

$\gamma$ (gamma) is a parameter we can tune to adjust the amount of optimization applied with regard to the $\mathbb{KL}$ loss

With that, let's implement this an explore the results. I prepared a module called models.py that contains the AE and VAE classes we'll be using. Both are simple networks, with two 512-dimensional hidden intermediary layers and one 2-dimensional latent layer. The VAE has two additional 2-dimensional layers to sample from. I'll run the preprocessing.py module here to demostrate that as well:

In [2]:
from scripts.models import AE, VAE
from scripts.preprocessing import Preprocessor

df = df = pd.read_csv('data/city_payments_fy2017.csv')
pre = Preprocessor()
data = pre.process(df)
Cleaning data...
Label encoding the data...
Sampling from the original data...
One-hot-encoding the data...
Generating tensors...
Combining tensors into one dataset...
Preprocessing complete!
Total Time: 0.25 minutes

Now we train a traditional autoencoder

In [7]:
latent_dims = 2
input_dims = data.X_train.shape[1]
output_dims = data.X_train.shape[1]

model1 = AE(input_dims, latent_dims, output_dims, batch_norm=False)
model1.fit(data, epochs=20)
epoch: [1/20] | loss: 710.4146 | elapsed time: 0.55 minutes
epoch: [2/20] | loss: 634.1962 | elapsed time: 0.53 minutes
epoch: [3/20] | loss: 620.7513 | elapsed time: 0.53 minutes
epoch: [4/20] | loss: 601.9883 | elapsed time: 0.53 minutes
epoch: [5/20] | loss: 593.5554 | elapsed time: 0.53 minutes
epoch: [6/20] | loss: 565.4938 | elapsed time: 0.54 minutes
epoch: [7/20] | loss: 555.7859 | elapsed time: 0.54 minutes
epoch: [8/20] | loss: 577.5383 | elapsed time: 0.55 minutes
epoch: [9/20] | loss: 561.3287 | elapsed time: 0.61 minutes
epoch: [10/20] | loss: 574.3048 | elapsed time: 0.84 minutes
epoch: [11/20] | loss: 529.9324 | elapsed time: 1.0 minutes
epoch: [12/20] | loss: 516.5233 | elapsed time: 0.85 minutes
epoch: [13/20] | loss: 528.6144 | elapsed time: 0.74 minutes
epoch: [14/20] | loss: 575.4102 | elapsed time: 0.83 minutes
epoch: [15/20] | loss: 531.4222 | elapsed time: 1.0 minutes
epoch: [16/20] | loss: 527.0771 | elapsed time: 1.08 minutes
epoch: [17/20] | loss: 534.6619 | elapsed time: 1.1 minutes
epoch: [18/20] | loss: 516.548 | elapsed time: 0.95 minutes
epoch: [19/20] | loss: 533.3744 | elapsed time: 0.82 minutes
epoch: [20/20] | loss: 511.7211 | elapsed time: 0.72 minutes
In [18]:
model1
Out[18]:
AE(
  (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (linear1): Linear(in_features=4845, out_features=512, bias=True)
  (linear2): Linear(in_features=512, out_features=2, bias=True)
  (linear3): Linear(in_features=2, out_features=512, bias=True)
  (linear4): Linear(in_features=512, out_features=4845, bias=True)
)

In the AE.fit() method, we train the autoencoder, but we also include a loop to validate the model on test data unseen during training. Here is a plot of both train and test loss:

In [8]:
model1.plot_loss()

We observe that the loss dramtically drops after only 20 batches. It then very gradually declines. Let's see what the embedding looks like:

In [10]:
model1.plot_projection(data, which='test_set')
In [13]:
model2 = AE(input_dims, latent_dims, output_dims, batch_norm=True)
model2.fit(data, epochs=20)
epoch: [1/20] | loss: 708.7727 | elapsed time: 0.55 minutes
epoch: [2/20] | loss: 686.2775 | elapsed time: 0.54 minutes
epoch: [3/20] | loss: 656.3316 | elapsed time: 0.54 minutes
epoch: [4/20] | loss: 654.1249 | elapsed time: 0.54 minutes
epoch: [5/20] | loss: 677.3488 | elapsed time: 0.54 minutes
epoch: [6/20] | loss: 645.1639 | elapsed time: 0.54 minutes
epoch: [7/20] | loss: 649.9794 | elapsed time: 0.54 minutes
epoch: [8/20] | loss: 594.2822 | elapsed time: 0.54 minutes
epoch: [9/20] | loss: 619.0198 | elapsed time: 0.54 minutes
epoch: [10/20] | loss: 637.7162 | elapsed time: 0.54 minutes
epoch: [11/20] | loss: 633.1409 | elapsed time: 0.54 minutes
epoch: [12/20] | loss: 645.791 | elapsed time: 0.54 minutes
epoch: [13/20] | loss: 606.5604 | elapsed time: 0.54 minutes
epoch: [14/20] | loss: 592.6423 | elapsed time: 0.54 minutes
epoch: [15/20] | loss: 612.3545 | elapsed time: 0.54 minutes
epoch: [16/20] | loss: 592.6324 | elapsed time: 0.54 minutes
epoch: [17/20] | loss: 585.6572 | elapsed time: 0.54 minutes
epoch: [18/20] | loss: 598.5024 | elapsed time: 0.53 minutes
epoch: [19/20] | loss: 589.979 | elapsed time: 0.55 minutes
epoch: [20/20] | loss: 557.442 | elapsed time: 0.54 minutes
In [14]:
model2.plot_loss()
In [17]:
model2.plot_projection(data, which='test_set')
In [37]:
model3 = VAE(input_dims, latent_dims, output_dims, batch_norm=True)
model3.fit(data, epochs=20, gamma=1)
epoch: [1/20] | loss: 1726.2412 | elapsed time: 0.54 minutes
epoch: [2/20] | loss: 1203.718 | elapsed time: 0.56 minutes
epoch: [3/20] | loss: 1166.6509 | elapsed time: 0.55 minutes
epoch: [4/20] | loss: 1075.7648 | elapsed time: 0.58 minutes
epoch: [5/20] | loss: 1021.8521 | elapsed time: 0.55 minutes
epoch: [6/20] | loss: 1015.1688 | elapsed time: 0.59 minutes
epoch: [7/20] | loss: 1038.255 | elapsed time: 0.55 minutes
epoch: [8/20] | loss: 1014.0497 | elapsed time: 0.55 minutes
epoch: [9/20] | loss: 1032.7101 | elapsed time: 0.54 minutes
epoch: [10/20] | loss: 1018.7817 | elapsed time: 0.56 minutes
epoch: [11/20] | loss: 1019.3304 | elapsed time: 0.57 minutes
epoch: [12/20] | loss: 1017.0781 | elapsed time: 0.55 minutes
epoch: [13/20] | loss: 1009.1345 | elapsed time: 0.54 minutes
epoch: [14/20] | loss: 1009.4597 | elapsed time: 0.55 minutes
epoch: [15/20] | loss: 1011.3191 | elapsed time: 0.57 minutes
epoch: [16/20] | loss: 1022.9342 | elapsed time: 0.55 minutes
epoch: [17/20] | loss: 1010.7504 | elapsed time: 0.54 minutes
epoch: [18/20] | loss: 1001.7854 | elapsed time: 0.54 minutes
epoch: [19/20] | loss: 1041.136 | elapsed time: 0.55 minutes
epoch: [20/20] | loss: 980.733 | elapsed time: 0.54 minutes
In [38]:
model3.plot_loss()
In [41]:
model3.plot_projection(data, which='test_set')
In [10]:
model4 = VAE(input_dims, latent_dims, output_dims, batch_norm=True)
model4.fit(data, epochs=20, gamma=0.001)
epoch: [1/20] | loss: 762.7522 | elapsed time: 0.56 minutes
epoch: [2/20] | loss: 695.9151 | elapsed time: 0.54 minutes
epoch: [3/20] | loss: 684.8369 | elapsed time: 0.54 minutes
epoch: [4/20] | loss: 645.6278 | elapsed time: 0.54 minutes
epoch: [5/20] | loss: 640.1102 | elapsed time: 0.55 minutes
epoch: [6/20] | loss: 589.3079 | elapsed time: 0.55 minutes
epoch: [7/20] | loss: 627.2583 | elapsed time: 0.55 minutes
epoch: [8/20] | loss: 632.123 | elapsed time: 0.55 minutes
epoch: [9/20] | loss: 603.3551 | elapsed time: 0.55 minutes
epoch: [10/20] | loss: 619.6804 | elapsed time: 0.56 minutes
epoch: [11/20] | loss: 639.6083 | elapsed time: 0.55 minutes
epoch: [12/20] | loss: 617.1934 | elapsed time: 0.55 minutes
epoch: [13/20] | loss: 556.191 | elapsed time: 0.55 minutes
epoch: [14/20] | loss: 625.5158 | elapsed time: 0.55 minutes
epoch: [15/20] | loss: 616.4233 | elapsed time: 0.55 minutes
epoch: [16/20] | loss: 572.6877 | elapsed time: 0.55 minutes
epoch: [17/20] | loss: 578.4244 | elapsed time: 0.55 minutes
epoch: [18/20] | loss: 605.4987 | elapsed time: 0.55 minutes
epoch: [19/20] | loss: 577.6849 | elapsed time: 0.55 minutes
epoch: [20/20] | loss: 562.2061 | elapsed time: 0.55 minutes
In [11]:
model4.plot_loss()
In [14]:
model4.plot_projection(data, which='test_set')

Finally, we can recover the labels and sort the transactions by reconstruction loss to identify potential outliers:

In [66]:
recovered_data = pd.concat([
    data.recover_labels(model4.labels[:,:-1]), 
    pd.DataFrame(model4.labels[:,-1].numpy(), 
                 columns=['reconstruction_loss'])
], axis=1)

top_ten = recovered_data.sort_values(by='reconstruction_loss', ascending=False).head(10)
In [77]:
top_ten.transaction_amount = top_ten.transaction_amount.apply(lambda x: '%.2f' % x)
top_ten
Out[77]:
document_no department_title character_title sub_obj_title vendor_name doc_ref_no_prefix_definition contract_description payment_method weekday day month year transaction_amount amount_sign reconstruction_loss
661 CHEK17009282 SINKING FUND COMMISSION DEBT SERVICE PRINCIPAL PAYMENTS ON CITY DEBTL/T WATER & SEWER REVENUE BOND SINKING FUND payment voucher None CHEK Thursday 28 July 2016 62040000.00 1.0 13.881687
1118 ACHD17170973 COMMERCE REAL PROPERTY PROF SRVAIRFIELD RUNWAYS/IMPROVEM WESTON SOLUTIONS, INC. professional services RENEWAL TWO ACHD Friday 19 May 2017 12361.93 1.0 13.420626
1089 CHEK17004136 FIRE MATERIALS AND SUPPLIES DRY GOODS/NOTIONS/WEARING APPAREL ADAM K THIEL reimbursable expense vouchers for City employees None CHEK Thursday 14 July 2016 838.34 1.0 13.157266
1007 CHEK17094091 MDO-OFFICE OF TECHNOLOGY REAL PROPERTY COMPUTER EQUIPMENT NU VISION TECHNOLOGIES LLC procurement Purchase and Installation of Telephone Hardwar... CHEK Thursday 09 February 2017 176886.88 1.0 12.605608
698 ACHD17185928 WATER PURCHASE OF SERVICES ARCHITECTURAL & ENGINEERING SRVCS HAZEN AND SAWYER professional services Capital Improvements ACHD Friday 16 June 2017 49153.67 1.0 12.494843
437 CHEK17028129 STREETS EQUIPMENT VEHICLES MOTOR AND MOTORLESS CLEVELAND BROTHERS EQUIPMENT CO procurement CLEVELAND BROTHERS EQUIPMENT CO Bid # 00011420 CHEK Thursday 08 September 2016 135744.00 1.0 12.241909
239 CHEK17073038 RECREATION REAL PROPERTY PROF SRVSWIMMING POOLS BUELL KRATZER POWELL professional services Architectural Requirement CHEK Thursday 15 December 2016 35543.48 1.0 12.149884
407 ACHD17173297 OFFICE OF BH/MR SERVICES MATERIALS AND SUPPLIES PRECISION PHOTOGRAPHIC AND ARTISTS XEROX CORPORATION payment voucher None ACHD Friday 26 May 2017 70.00 1.0 12.106859
537 CHEK17063829 OFFICE OF SUPPORTIVE HOUSING REAL PROPERTY PROF SRVEMER FIRE SAFETY EQUIP BURRIS ENGINEERS, INC. professional services Engineering Requirements CHEK Friday 25 November 2016 804.38 1.0 11.962860
913 JEXX17000388 CITY CONTROLLER PURCHASE OF SERVICES SEMINAR AND TRAINING SESSIONS AMERICAN CONTRACT COMPLIANCE ASSOCIATION None None JEXX Tuesday 23 August 2016 650.00 1.0 11.815714
In [78]:
report = dp.Report(
  dp.Text('## Top Ten Potential Outlier Transactions'),
  dp.Text('### Ranked by Reconstruction Loss'),
  dp.Table(top_ten),
)
report.upload(name='potential_outliers', visibility=dp.Visibility.DEFAULT)

Uploading report and associated data - please wait...

Your report only contains a single element - did you know you can include additional plots, tables and text in a single report? More info here

Report successfully uploaded. View and share your report here, or edit your report here.

In [42]:
report = dp.Report(dp.Plot(model4.plot_projection(data, which='test_set')))
report.save('vae_batch_norm_gamma_1_proj.html', open=True)
report.upload(name='vae_batch_norm_gamma_1_proj', 
              visibility=dp.Visibility.DEFAULT,
              formatting=dp.ReportFormatting(
                  width=dp.ReportWidth.NARROW)
             )

Report saved to ./vae_batch_norm_gamma_1_proj.html

Uploading report and associated data - please wait...

Your report only contains a single element - did you know you can include additional plots, tables and text in a single report? More info here

Report successfully uploaded. View and share your report here, or edit your report here.

Conclusion¶

Let's return to our original motivation, which is to learn a new way of seeing financial transactions to identify a representative sample or even focus sampling on areas of higher risk. We saw that by modeling transactions using a traditional autoencoder, as well as a variational autoencoder, could help facilitate this by learning a lower-dimensional representation of the data that disentangles the features we are interested in or even unaware of. Through experimentation we identified the various parameters we can tune and how to improve model performance. We also looked at how to rank transactions based on their reconstruction loss to identify potential outliers.

Challenges and opportunities going forward include implementing this on a larger data set, which would almost certainly need to happen in a distributed way. We could use Spark or Hadoop for this. It would also be useful to deploy this kind of pipeline to a secure web application where businesses could upload their data for automatic embedding and dashboards.